Neighborhood Recommender: New York vs Toronto

Introduction

The goal of this project is to build a neighborhood recommendation system. In this project, we will use New York and Toronto data sets to show how the system works. This project presents a solution to those who want to move from one city to another either for personal or business reasons. The system will compare neighborhoods of Toronto city and New York city by exploring the venues of individual neighborhoods. The project would establish a methodology to individuals and businesses to choose what neighborhood best suits their living or business style on the target city based on their preferred neighborhood in their current city. The comparison is accomplished by calculating the similarity index between each neighborhood of both cities. This will help listing the neighborhoods from the most similar to the most dissimilar to a specific neighborhood. Even though the project is focused only on Toronto and New York, the idea can be applied on any set of cities.

Data

Previously in the course, we studied two neighborhood data sets: Toronto and New York. These two data sets will be used in this project beside the venues data that will be obtained from the Foursquare API. Foursquare venues data will be used to create two data sets: one for New York neighborhoods and another for Toronto. These two data sets will be further transformed based on the category of the venues so that we have categories as columns as shown below.

image.png

The categories columns will be used as vectors in order to calculate the similarity index, so both data sets must have the same number of categories. The similarity index is simply the Pearson coefficient between the preferred neighborhood and each neighborhood in the target city. The comparison result will be saved into a new data set where we list each neighborhood in the target city along with the similarity index to the preferred neighborhood in the home city. Below is a screenshot of what the result data frame would look like.

image.png

Methodology

We're going to find out how similar each neighborhood is to the another through the Pearson Correlation Coefficient. It is used to measure the strength of a linear association between two variables. The formula for finding this coefficient between sets X and Y with N values can be seen in the image below.

Why Pearson Correlation?

Pearson correlation is invariant to scaling, i.e. multiplying all elements by a nonzero constant or adding any constant to all elements. For example, if you have two vectors X and Y,then, pearson(X, Y) == pearson(X, 2 * Y + 3).

alt text

The values given by the formula vary from r = -1 to r = 1, where 1 forms a direct correlation between the two entities (it means a perfect positive correlation) and -1 forms a perfect negative correlation.

In our case, a 1 means that the two neighborhoods are similar while a -1 means the opposite. </em>

Instead of writing a code to calculate Pearson Coefficient, SciPy provides a function to easily accomplish that. All we have to do is to pass the two vectors (the two neighborhoods) to the SciPy function pearsonr(neighborhood_1, neighborhood_2). Each vector in this case represents the values of the categories for each neighborhood. Therefore, we need to modify both cities data so that both data frames have the same number of categoies and also in the same order.

image.png

As shown above, there are a total of 475 unique categories. Our goal is to make sure that each city data frame includes these categories. If a category does not exist in a city, we make it equal to zero. Also, we need to sort the categories in both data frames so that the vectors have correspnding values.

v1 = [x1, x2, ..., xn] and v2=[x1, x2, ..., xn]

The screenshots below show how to add and sort categories.

image.png

image.png

image.png

Results

In the project Jupyter Notebook, a function was written to return a data frame with similarity index between one neighborhood from one city and each neighborhood in the target city. Below are two examples that illustrates the results.

Example 1: recommending neighborhoods in Toronto

Let's say that an individual is interested in moving from New York to Toronto. Currently, they live in Bensonhurst, Brooklyn, NY. The goal is to find similar neighborhood/s in Toronto to that of New York. The way this works is similar to a recommender system.

Let's call the similarity function, passing "nyc", "Brooklyn", and "Bensonhurst" as parameters. Below is the head of the sorted data frame returned by the function. Note that these are the five most similar neighborhoods as the data frame is sorted from the most similar to the most dissimilar. The "Similarity" column represents the similarity index, which ranges from -1 (most dissimilar) to 1 (most similar).

image.png

Let's check the most common venues in Bensonhurst and the most similar neighborhood in Toronto.

image.png

image.png

As you can see, both neighborhoods feature Asian venues: Sushi, Chinese, and Japanese Restaurants. Also, they are both kind of suburban neighborhoods as shown in the two maps below.

image.png

Below is a map for Toronto neighborhoods. The darker the color, the more similar the neighborhood to Bensonhurst neighborhood.

image.png

Example 2: recommending neighborhoods in New York City

In this example, we would like to find the New York neighborhoods that are most similar to a Toronto neighborhood. The target Toronto neighborhood is Downtown Toronto, Bathurst Quay.

image.png

Let's check the most common venues in Bathhurst Quay and the most similar neighborhoods in New York City.

image.png

image.png

As you can see, both neighborhoods have a lot in common either in terms of venue categories or their locations in the city; both are downtown neighborhoods as shown in the maps below. Also, if you look closely at these two maps, both neighborhoods have parks nearby.

Bathurst Quay Map

image.png

New York Neighborhoods Similarity Map

image.png

Discussion

As you might have noticed in the two examples, the similarity index is slightly in the low range (<60%) even for the most similar neighborhood. There are many reasons for that including the large number of categories, 475. Also, more testing is needed to check which similarity measure best fit this problem, such as Euclidean distance and Jaccard Coefficient. Nevertheless, the two examples also show that Pearson coefficient is still a solid approach.

Conclusion

This project presented a simple and yet efficient approach to building a neighborhood recommender system. Even though the system was based only on venue categories, more data can be added to make the system more efficient such as using demographic data of each neighborhoods. Also, the system can help businesses that are looking to scale up and open new branches by recommending neighborhoods based on the neighborhoods of the current successful branches.